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Spatially proximate amino acids in a protein tend to coevolve. A protein's 3D structure hence 
leaves an echo of correlations in the evolutionary record. Reverse engineering 3D structures from 
such correlations is an open problem in structural biology, pursued with increasing vigor as more and 
more protein sequences continue to fill the data banks. Within this task lies a statistical inference 
problem, rooted in the following: correlation between two sites in a protein sequence can arise from 
firsthand interaction, but can also be network-propagated via intermediate sites; observed correlation 
is not enough to guarantee proximity. To separate direct from indirect interactions is an instance 
of the general problem of inverse statistical mechanics, where the task is to learn model parameters 
(fields, couplings) from observables (magnetizations, correlations, samples) in large systems. In 
the context of protein sequences, the approach has been referred to as direct- coupling analysis. 
Here we show that the pseudolikelihood method, applied to 21-state Potts models describing the 
statistical properties of families of evolutionarily related proteins, significantly outperforms existing 
approaches to the direct-coupling analysis, the latter being based on standard mean-field techniques. 
This improved performance also relies on a modified score for the coupling strength. The results 
are verified using known crystal structures of specific sequence instances of various protein families. 
Code implementing the new method can be found at http : //plmdca. esc .kt h. se/| 

PACS numbers: 02.50.Tt — Inference methods, 87.10.Vg - Biological information, 87.15.Qt - Sequence 
analysis, 87.14.E — Proteins 



I. INTRODUCTION 

In biology, new and refined experimental techniques 
have triggered a rapid increase in data availability during 
the last few years. Such progress needs to be accompa- 
nied by the development of appropriate statistical tools 
to treat growing data sets. An example of a branch un- 
dergoing intense growth in the amount of existing data 
is protein structure prediction (PSP), which, due to the 
strong relation between a protein's structure and its func- 
tion, is one central topic in biology. As we shall see, one 
can accurately estimate the 3D structure of a protein by 
identifying which amino-acid positions in its chain are 
statistically coupled over evolutionary time scales [H-0- 
Much of the experimental output is today readily acces- 
sible through public databases such as Pfam Q, which 
collects over 13,000 families of evolutionarily related pro- 
tein domains, the largest of them containing more than 
2 x 10 5 different amino-acid sequences. Such databases 
allow researchers to easily access data, to extract infor- 
mation from it, and to confront their results. 

A recurring difficulty when dealing with interacting 
systems is distinguishing direct interactions from inter- 
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actions mediated via multi-step paths across other ele- 
ments. Correlations are, in general, straightforward to 
compute from raw data, whereas parameters describing 
the true causal ties are not. The network of direct in- 
teractions can be thought of as hidden beneath observ- 
able correlations, and untwisting it is a task of inher- 
ent intricacy. In PSP, using mathematical means to 
dispose of the network-mediated correlations observable 
in alignments of evolutionarily related (and structurally 
conserved) proteins is necessary to get optimal results 
[H [H, and yields improvements worth the computa- 
tional strain put on the analysis. This approach to PSP, 
which we refer to as direct- coupling analysis (DCA), is 
the focus of this paper. 

In a more general setting, the problem of inferring 
interactions from observations of instances amounts to 
inverse statistical mechanics, sl field which has been 
intensively pursued in statistical physics over the last 
decade [10l - [23j . Similar tasks were formulated earlier in 
statistics and machine learning, where they have been 
called model learning and inference ■ To illustrate 

this concretely, let us start from an Ising model, 

1 l N \ 

P(cr 1 , . . . , a N ) = 77 exp I hjo-j + J^crA 

/j l<i<j<N J 

(1) 

and its magnetizations m, = log Z and connected cor- 
relations Cij = dj i} logZ — miirij. Counting the number 
of observables (rrij and Cij) and the number of parameters 
(hi and ) it is clear that perfect knowledge of the mag- 
netizations and correlations should suffice to determine 
the external fields and the couplings exactly. It is, how- 
ever, also clear that such a process must be computation- 
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ally expensive, since it requires the computation of the 
partition function Z for an arbitrary set of parameters. 
The exact but iterative procedure known as Boltzmann 
machines [28[ does in fact work on small systems, but it 
is out of question for the problem sizes considered in this 
paper. On the other hand, the mean-field equations of 
Q read [Hll|: 



tanh 



Jij m j- 



(2) 



From ^ and the fluctuation-dissipation relations an 
equation can be derived connecting the coupling coef- 
ficients Jij and the correlation matrix c = (cy) [lOj : 



Ji 



(3) 



Equations ([2]) and ([3]) exemplify typical aspects of in- 
verse statistical mechanics, and inference in large sys- 
tems in general. On one hand, the parameter recon- 
struction using these two equations is not exact. It is 
only approximate, because the mean-field equations © 
are themselves only approximate. It also demands rea- 
sonably good sampling, as the matrix of correlations is 
not invertible unless it is of full rank, and small noise 
on its 0(N 2 ) entries may result in large errors in esti- 
mating the J^. On the other hand, this method is fast, 
as fast as inverting a matrix, because one does not need 
to compute Z. Except for mean-field methods as in ((2J, 
approximate methods recently used to solve the inverse 
Ising problem can be grouped as expansion in correla- 
tions and clusters I la. 1 191 . methods based on the Bethe 
approximation (ItI Il8l l20H22j ] . and the pseudolikelihood 
method (HH^. 

For PSP, it is not the Ising model but a 21-state Potts 
model which is pertinent [l[ : The model shall be learned 
such that it represents the statistical features of large 
multiple sequence alignments of homologous (evolution- 
arily related) proteins, and to reproduce the statistics of 
correlated amino acid substitutions. This can be done 
with the Potts equivalent of Eq. (JTJ), i-e. using a model 
with pairwise interactions. As will be detailed below, 
strong interactions can be interpreted as indicators for 
spatial vicinity of residues in the three-dimensional pro- 
tein fold, even if residues are well separated along the 
sequence - thus linking evolutionary sequence statistics 
with protein structure. But which of all the inference 
methods in inverse statistical mechanics, machine learn- 
ing or statistics is most suitable for treating real protein 
sequence data? How do the test results obtained for inde- 
pendently generated equilibrium configurations of Potts 
models translate to the case of protein sequences, which 
are neither independent nor equilibrium configurations 
of any well-defined statistical-physics model? The main 
goal of this paper is to move towards an answer to this 
question by showing that the pseudolikelihood method is 
a very powerful means to perform DCA, with a predic- 
tion accuracy considerably out-performing methods pre- 
viously assessed. 



The paper is structured as follows: in Section UH we 
review the ideas underlying PSP and DCA and explain 
the biological hypotheses linking protein 3D structure to 
correlations in amino- acid sequences. We also review ear- 
lier approaches to DCA. In Section llrfl we describe the 
Potts model in the context of DCA and the properties 
of exponential families. We further detail a maximum- 
likelihood (ML) approach as brought to bear on the in- 
verse Potts problem and discuss in more detail why this 
is impractical for realistic system sizes, and we intro- 
duce, similarly to (J3J) above, the inverse Potts mean-field 
model algorithm for the DCA (mfDCA) and a pseudolike- 
lihood maximization procedure (plmDCA). This section 
also covers algorithmic details of both models such as reg- 
ularization and sequence reweighting. A further impor- 
tant issue is the selection of an interaction score, which 
reduces coupling matrices to a scalar score, and allows for 
ranking of couplings according to their 'strength'. In Sec- 
tion IIV1 we present results from prediction experiments 
using mfDCA and plmDCA assessed against known crys- 
tal structures. In Section [Vj we summarize our findings, 
put their implications into context, and discuss possible 
future developments. The appendixes contain additional 
material supporting the main text. 



II. PROTEIN STRUCTURE PREDICTION AND 
DIRECT-COUPLING ANALYSIS 

Proteins are essential players in almost all biological 
processes. Primarily, proteins are linear chains, each site 
being occupied by 1 out of 20 different amino acids. Their 
function relies, however, on the protein fold, which refers 
to the 3D conformation into which the amino-acid chain 
curls. This fold guarantees, e.g., that the right amino 
acids are exposed in the right positions at the protein 
surface, thus forming biochemically active sites, or that 
the correct pairs of amino acids are brought into contact 
to keep the fold thermodynamically stable. 

Experimentally determining the fold, using e.g. x-ray 
crystallography or NMR tomography, is still a rather 
costly and time-consuming process. On the other hand, 
every newly sequenced genome results in a large num- 
ber of newly predicted proteins. The number of se- 
quenced organisms now exceeds 3, 700, and continues 
to grow exponentially (genomesonline.org [32]). The 
most prominent database for protein sequences, Uniprot 
(uniprot.org [33]), lists about 25,000,000 different pro- 
tein sequences, whereas the number of experimentally 
determined protein structures is only around 85,000 
(pdb.org II). 

However, the situation of structural biology is not as 
hopeless as these numbers might suggest. First, proteins 
have a modular architecture; they can be subdivided into 
domains which, to a certain extent, fold and evolve as 
units. Second, domains of a common evolutionary ori- 
gin, i.e., so-called homologous domains, are expected to 
almost share their 3D structure and to have related func- 
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Contact Q 



Correlation 



FIG. 1. (Color online) Left panel: small MSA with two posi- 
tions of correlated amino-acid occupancy. Right panel: hypo- 
thetical corresponding spatial conformation, bringing the two 
correlated positions into direct contact. 



tion. They can therefore be collected in protein domain 
families: the Pfam database (pfam.sanger.ac.uk 
lists almost 14,000 different domain families, and the 
number of sequences collected in each single family ranges 
roughly from 10 2 to 10 5 . In particular the larger fami- 
lies, with more than 1,000 members, are of interest to 
us, as we argue that their natural sequence variability 
contains important statistical information about the 3D 
structure of its member proteins, and can be exploited to 
successfully address the PSP problem. 

Two types of data accessible via the Pfam database 
are especially important to us. The first is the multiple 
sequence alignment (MSA), a table of the amino acid 
sequences of all the protein domains in the family lined 
up to be as similar as possible. A (small and illustrative) 
example is shown in Fig. Q] (left panel). Normally, not all 
members of a family can be lined up perfectly, and the 
alignment therefore contains both amino acids and gaps. 
At some positions, an alignment will be highly specific 
(cf. the second, fully conserved column in Fig. [IJ , while 
at others it will be more variable. The second data type 
concerns the crystal structure of one or several members 
of a family. Not all families provide this second type of 
data. We discuss its use for an a posteriori assessment of 
our inference results in detail in Sec. IIV1 

The Potts-model based inference uses only the first 
data type, i.e., the sequence data. Small spatial separa- 
tion between amino acids in a protein, cf. the right panel 
of Fig. [TJ encourages co-occurrence of favorable amino- 
acid combinations, cf. the left panel of Fig. [I] This spices 
the sequence record with correlations, which can be re- 
liably determined in sufficiently large MSAs. However, 
the use of such correlations for predicting 3D contacts as 
a first step to solve the PSP problem remained of limited 
success [35M37I ] , since they can be induced both by direct 
interactions (amino acid A is close to amino acid B), and 
by indirect interactions (amino acids A and B are both 
close to amino acid C). Lapedes et al. [HI were the first 
to address, in a purely theoretical setting, these ambigui- 



ties of a correlation-based route to protein sequence anal- 
ysis, and these authors also outline a maximum-entropy 
approach to get at direct interactions. Weigt et al. [l[ 
successfully executed this program subsequently called 
direct- coupling analysis: the accuracy in predicting con- 
tacts strongly increases when direct interactions are used 
instead of raw correlations. 

To computationally solve the task of inferring inter- 
actions in a Potts model, [l| employed a generalization 
of the iterative message-passing algorithm susceptibility 
propagation previously developed for the inverse Ising 
problem Methods in this class are expected to out- 
perform mean-field based reconstruction methods similar 
to ([3]) if the underlying graph of direct interactions is lo- 
cally close to tree-like, an assumption which may or may 
not be true in a given application such as PSP. A sub- 
stantial draw-back of susceptibility propagation as used 
in [l[ is that it requires a rather large amount of auxiliary 
variables, and that DCA could therefore only be carried 
out on protein sequences that are not too long. In 0], 
this obstacle was overcome by using instead a simpler 
mean-field method, i.e., the generalization of ((3]) to a 21- 
state Potts model. As discussed in Q, this broadens the 
reach of the DCA to practically all families currently in 
Pfam. It improves the computational speed by a factor 
of about 10 3 -10 4 , and it appears also to be more accurate 
than the susceptibility-propagation based method of [l[ 
in predicting contact pairs. The reason behind this third 
advantage of mean-field over susceptibility propagation 
as an approximate method of DCA is unknown at this 
time. 

Pseudolikelihood maximization (PLM) is an alternative 
method developed in mathematical statistics to approxi- 
mate maximum-likelihood inference, which breaks down 
the a priori exponential time complexity of computing 
partition functions in exponential families [39|. On the 
inverse Ising problem it was first used by Ravikumar et 
al. [27| , albeit in the context of graph sign-sparsity recon- 
struction; two of us showed recently that it outperforms 
many other approximate inverse Ising schemes on the 
Sherrington-Kirkpatrick model, and in several other ex- 
amples [23|] . Although this paper reports the first use of 
the PLM method in DCA, the idea of using pseudolikeli- 
hoods for PSP is not completely novel. Balakrishnan et 
al. [|[ devised a version of this idea, but using a set-up 
rather different from that in [2j, regarding, for example, 
what portions of the data bases and which measures of 
prediction accuracy were used, and also, not couched in 
the language of inverse statistical mechanics. While a 
competitive evaluation between Q and § is still open, 
we have not attempted such a comparison in this work. 

Other ways of deducing direct interactions in PSP, not 
motivated from the Potts model but in somewhat simi- 
lar probabilistic settings, have been suggested in the last 
few years. A fast method utilizing Bayesian networks 
was provided by Burger and van Nimwegen [7|. More 
recently, Jones et al. Q introduced a procedure called 
PSICOV (Protein Sparse Inverse COVariance). While 
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DCA and PSICOV both appear capable of outperform- 
ing the Bayesian network approach 0, , their relative 
efficiency is currently open to investigation, and is not 
assessed in this work. 

Finally, predicting amino acid contacts is not only 
a goal in itself, but also a means to assemble protein 
complexes liol . l-ilj and to predict full 3D protein struc- 
tures [3j, |4j, |42[ . Such tasks require additional work, using 
the DCA results as input, and are outside the scope of 
the present paper. 

III. METHOD DEVELOPMENT 

A. The Potts model 

Let <r = (<7i, i72 j " ' • > &n) represent the amino acid se- 
quence of a domain with length N. Each Uj takes on 
values in {1,2, ...,<?}, with q = 21: one state for each of 
the 20 naturally occurring amino acids and one additional 
state to represent gaps. Thus, an MSA with B aligned 
sequences from a domain family can be written as an in- 
teger array {en 6 ' }^=i , with one row per sequence and one 
column per chain position. Given an MSA, the empirical 
individual and pairwise frequencies can be calculated as 

6=1 

fi j (k,l) = ^S(af\k)6(af\l), (4) 

6=1 

where 5(a, b) is the Kronecker symbol taking value 1 if 
both arguments are equal, and otherwise. fi(k) is hence 
the fraction of sequences for which the entry on position 
i is amino acid k, gaps counted as a 21st amino acid. 
Similarly, (fc, I) is the fraction of sequences in which 
the position pair holds the amino acid combination 
(k, I). Connected correlations are given as 

Cij(k,l) = fij(k,l) - fi(k) fj(l). (5) 

A (generalized) Potts model is the simplest probabilis- 
tic model P(<t) which can reproduce the empirically ob- 
served fi(k) and fij(k,l). In analogy to dJ) it is defined 

as 

1 l N \ 

P {°) = 7F CX P z2 h i(°i) + E J ii( a i> a j) > ( 6 ) 
\i=l l<i<j<N J 

in which hi(o~i) and Jij(<Ji,aj) are parameters to be de- 
termined through the constraints 

P{c H = k) = 53 P(tr) = 

cr 

P(<Ti = k, a, = I) = ^ P{rr) = f ZJ (k, I). (7) 



tj 




It is immediate that the probabilistic model which max- 
imizes the entropy while satisfying Eq. ([JJ must take 
the Potts model form. Finding a Potts model which 
matches empirical frequencies and correlations is there- 
fore referred to as a maximum- entropy inference (43ll44|. 
cf. also [H, [38[ f° r a formulation in terms of protein- 
sequence modeling. 

On the other hand, we can use Eq. |S| as a varia- 
tional ansatz and maximize the probability of the input 
data set with respect to the model parameters 

hiia) and Jij(a,o~'); this maximum-likelihood perspective 
is used in the following. We note that the Ising and the 
Potts models (and most models which are normally con- 
sidered in statistical mechanics) are examples of expo- 
nential families, and have the property that means and 
correlations are sufficient statistics [45144 7| . Given un- 
limited computing power to determine Z, reconstruction 
can not be done better using all the data compared to us- 
ing only (empirical) means and (empirical) correlations. 
It is only when one cannot compute Z exactly and has 
to resort to approximate methods, that using directly all 
the data can bring any advantage. 



B. Model parameters and gauge invariance 

The total number of parameters of Eq, © is Nq + 
N ( N ^ 1 ') g 2 ; brrt^ in fact, the model as it stands is over- 
parameterized in the sense that distinct parameter sets 
can describe the same probability distribution. It is easy 
to see that the number of nonrcdundant parameters is 
N(q - 1) + N{N ~ 1) {q - l) 2 , cf. an Ising model (q = 2), 
which has jV ^ +1 ' 1 parameters if written as in Eq. (JXJ) 
but would have 2N 2 parameters if written in the form of 
Eq. ©. 

A gauge choice for the Potts model, which eliminates 
the overparametrization in a similar manner as in the 
Ising model (and reduces to that case for q = 2), is 

9 q i 

s— 1 s—1 s—1 

for all i, j(> i), k, and I. Alternatively, we can choose a 
gauge where one index, say i = q, is special, and measure 
all interaction energies with respect to this value, i.e., 

J ij (q,l) = J ij (k,q) = h i (q) = 0, (9) 

for all i, j(> i), k, and I, cf. [2j. This gauge choice 
corresponds to a lattice gas model with q — 1 different 
particle types, and a maximum occupation number one. 

Using either © or © reconstruction is well-defined, 
and it is straight-forward to translate results obtained in 
one gauge to the other. 
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C. The inverse Potts problem 



E. Pseudolikelihood maximization 



Given a set of independent equilibrium configurations 
of the model Eq. ©, the ordinary statistical 
approach to inferring parameters {h, J} would be to let 
those parameters maximize the likelihood (i.e., the prob- 
ability of generating the data set for a given set of pa- 
rameters). This is equivalent to minimizing the (rescaled) 
negative log-likelihood function 



6=1 



(10) 



(11) 



For the Potts model ([6]), this becomes 

N q 

l(h, J) = log Z-Y,j2fi(k)hi(k) 

4=1 fc=l 

- E E h(k,i)Jij(k,i). 

l<i<j<N kd=l 

I is diffcrcntiable, so minimizing it means looking for a 
point at which d^j-^l = and dj^fu^l = 0. Hence, ML 
estimates will satisfy 

dhi(k) logZ- fi(k) = 0, 

d Ji . {kil) logZ-f ij (k,l) = 0. (12) 

To achieve this minimization computationally, we need to 
be able to calculate the partition function Z of Eq. (O for 
any realization of the parameters {h, J}. This problem 
is computationally intractable for any reasonable system 
size. Approximate minimization is essential, and we will 
show that even relatively simple approximation schemes 
lead to accurate PSP results. 



D. Naive mean-field inversion 

The mfDCA algorithm in is based on the simplest 
and computationally most efficient approximation, i.e., 
naive mean-field inversion (NMFI). It starts from the 
proper generalization of ([2]), cf. [48], and then uses lin- 
ear response: The J's in the lattice-gas gauge Eq. © 
become: 

J% MPI (k,l) = -(C-\ b , (13) 

where a = (q — l)(i — 1) + k and b = (q — — 1) + 1. 
The matrix C is the N(q — 1) x N(q — 1) covariance ma- 
trix assembled by joining the N(q — 1) x N(q — 1) values 
Cij(k,l) as defined in Eq. ([5]), but leaving out the last 
state q. In Eq. flT3"|) . i,j £ {1,...,N} are site indices, 
and k, I run from 1 to q — 1. Under gauge Eq. ([9]), all 
the other coupling parameters are zero. The term 'naive' 
has become customary in the inverse statistical mechan- 
ics literature, often used to highlight the difference to a 
Thouless- Anderson-Palmer level inversion or one based 
on the Bethe approximation. The original meaning of 
this term lies, as far as we are aware, in information ge- 
ometry mia]. 



Pseudolikelihood substitutes the probability in (fTOj) by 
the conditional probability of observing one variable a r 
given observations of all the other variables <x\ r . That 
is, the starting point is 



exp I h r (a i r' > ) + Jri{vr 0> , ° 



N 



( & ) J b )\ 



i=l 



(14) 



N 



i=i 



where, for notational convenience, we take J r i(l,k) to 
mean Ji r (k, I) when i < r. Given an MSA, we can maxi- 
mize the conditional likelihood by minimizing 



g r (h r ,3 r 



B 



TlE 1 " 

fc=i 



P> 



{h,.,J 



.}(o- r 



\r »■ 



(15) 

Note that this only depends on h r and J r = {Ji r }i^r, 
i.e., on the parameters featuring node r. If (|15[) is used 
for all r this leads to unique values for the parameters h> 
but typically different predictions for 3 rq and 3 qr (which 
should be the same). Minimizing (|15l) must therefore be 
supplemented by some procedure on how to reconcile dif- 
ferent values of 3 rq and 3 qr ; one way would be to simply 



use their average \{3 rq + 3 qr ) [2 
We here reconcile different J 



rq and 3 qr by minimizing 



an objective function adding g r for all nodes: 



N 



IpseudoO^-t J) — ^ ] ffr(hy, Jr) 



(16) 



r=l 



N B 



r=l 6=1 

Minimizcrs of l pse udo generally do not minimize I; the re- 
placement of likelihood with pseudolikelihood alters the 
outcome. Note, however, that replacing / by l pse udo re- 
solves the computational intractability of the parameter 
optimization problem: instead of depending on the full 
partition function, the normalization of the conditional 
probability (|14p contains only a single summation over 
the q — 21 Potts states. The intractable average over 
the N — 1 conditioning spin variables is replaced by an 
empirical average over the data set in Eq. ([T( 



F. Regularization 



A Potts model describing a protein family with se- 
quences of 50-300 amino acids requires ca. 5 • 10 5 to 
2- 10 parameters. At present, few protein families are in 
this range in size, and regularization is therefore needed 
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to avoid overfitting. In NMFI, the problem results in an 
empirical covariance matrix which typicallyis not of full 
rank, and Eq. (fT3| is not well-defined. In [2|, one of the 
authors therefore used the pseudocount method where 
frequencies and empirical correlations are adjusted using 
a regularization variable A: 



fi(k) = 



fij (&> — 



X + B 
1 

X + B 



6=1 
B 



(17) 



4 + !>< W .*) S(af\l) 

1 6=1 



The pseudocount is a proxy for many observations, which 
- if they existed - would increase the rank of the corre- 
lation matrix; the pseudocount method hence promotes 
invcrtibility of the matrix in Eq. (|13l) . It was observed in 
Q that for good performance in DCA, the pseudocount 
parameter A has to be taken fairly large, on the order of 
B. 

In the PLM method, we take the standard route of 
adding a penalty term to the objective function: 

{h PLM ,J PLM } = argmin{/ pseudo (h,J)+i?(h,J)}. (18) 
{h.J} 

The turnout is then a trade-off between likelihood 
maximization and whatever qualities R is pushing for. 
Ravikumar et al. |27| pioneered the use of l\ regulariz- 
es for the inverse Ising problem, which forces a fraction 
of parameters to assume value zero, thus effectively re- 
ducing the number of parameters. This approach is not 
appropriate here since we are concerned with the accu- 
racy (resp. divergence) of the strongest predicted cou- 
plings; for our purposes it makes no substantial differ- 
ence whether weak couplings are inferred to be small or 
set precisely to 0. Our choice for R is therefore the sim- 
pler I2 norm 



JV 



R l2 (h,3) = X h J2\\K\\l + K 



r=l 



E 

l<i<j<N 



1 1 2 ' (19) 



using two regularization parameters Xh and Xj for field 
and coupling parameters. An advantage of a regular- 
izer is that it eliminates the need to fix a gauge, since 
among all parameter sets related by a gauge transforma- 
tion, i.e., all parameter sets resulting in the same Potts 
model, there will be exactly one set which minimizes 
the strictly convex regularizer. For the case of the 1% 
norm, it can be shown that this leads to a gauge where 

ELi-MM = &ht(k), ELi^M = !^-(0, and 

Es=l hi(s) = f° r an h j(> *)j an d I- 

To summarize this discussion: For NMFI, we regu- 
larize with pseudocounts under the gauge constraints 
Eq. ©• For PLM, we regularize with Ri 2 under the full 
parametrization. 



G. Sequence reweighting 

Maximum-likelihood inference of Potts models relies 
- as discussed above - on the assumption that the B 
sample configurations in our data set are independently 
generated from Eq. (j6)). This assumption is not correct 
for biological sequence data, which have a phylogenetic 
bias. In particular, in the databases there are many pro- 
tein sequences from related species, which did not have 
enough time of independent evolution to reach statistical 
independence. Furthermore, the selection of sequenced 
species in the genomic databases is dictated by human 
interest, and not by the aim to have an as independent 
as possible sampling in the space of all functional amino- 
acid sequences. A way to mitigate effects of uneven sam- 
pling, employed in Q, is to equip each sequence cr^ with 
a weight Wb which regulates its impact on the parameter 
estimates. Sequences deemed unworthy of independent- 
sample status (too similar to other sequences) can then 
have their weight lowered, whereas sequences which are 
quite different from all other sequences will contribute 
with a higher weight to the amino-acid statistics. 

A simple but efficient way (cf. [2]) is to measure 
the similarity sim(er( a ), cr^) of two sequences er( a ) and 
<x( b ) as the fraction of conserved positions (i.e., identical 
amino acids), and compare this fraction to a preselected 
threshold x , < x < 1. The weight given to a sequence 
er( b ) can then be set to Wb = where nib is the number 

of sequences in the MSA similar to trW: 

m b = \{a 6 {1, B) : sim(er( a ), er^) > x}\. (20) 

In a suitable threshold x was found to be 0.8, re- 
sults only weakly dependent on this choice throughout 
0.7 < x < 0.9. We have here followed the same procedure 
using threshold x — 0.9. The corresponding reweighted 
frequency counts then become 



fi(k) 



fij 



1 



A + B eff 
1 



6=1 



(21) 



X + B, 



eff 



X 

~+2 



6=1 



where B e f f — ^2 b =i Wb becomes a measure of the number 
of effectively nonredundant sequences. 

In the pseudolikelihood we use the direct analog of 
Eq. dHJ, i.e., 



(22) 



IpseudoQ^-i J) 

R 

1 



P E Wfc E l0g P {K-,3r}i a r = ^V\r = <Tw) 
J 6=1 r—1 



As in the frequency counts, each sequence is considered 
to contribute a weight Wb , instead of the standard weight 
one used in i.i.d. samples. 



7 



H. Interaction scores 

In the inverse Ising problem, each interaction is scored 
by one scalar coupling strength Jij. These can easily 
be ordered, e.g. by absolute size. In the inverse Potts 
problem, each position pair is characterized by a 

whole q x q matrix J,y , and some scalar score <Sy is needed 
in order to evaluate the 'coupling strength' of two sites. 

In and the score used is based on direct infor- 
mation (DI), i.e., the mutual information of a restricted 
probability model not including any indirect coupling ef- 
fects between the two positions to be scored. Its con- 
struction goes as follows: For each position pair 
(the estimate of) is used to set up a 'direct distribu- 
tion' involving only nodes i and j, 



(dir) 



(k, l) ~ exp (Jij(k, I) + h^ k + h' jt i) 



(23) 



h! i k and h'j ; are new fields, computed as to ensure agree- 
ment of the marginal single-site distributions with the 
empirical individual frequency counts fi(k) and fj(l). 
The score S® 1 is now calculated as the mutual infor- 
mation of P (dlr ^. 



E 

k,l=X 



^f r) (M)iog 



/<(*)£ (0 



(24) 



A nice characteristic of SB 1 is its invariance with re- 
spect to the gauge freedom of the Potts model, i.e., both 
choices Eqs. ([5]) and © (or any other valid choice) gen- 
erate identical S^ 1 . 

In the pseudolikelihood approach, we prefer not to use 
Sfj 1 , as this would require a pseudocount A to regularize 
the frequencies in (I24[) . introducing a third regulariza- 
tion variable in addition to \h and Aj. Another possible 
scoring function, already mentioned but not used in [![, 
is the Frobenius norm (FN) 



>ij\\2 



\ 



E J ^iy 



(25) 



Unlike S® 1 , ([25jl is not independent of gauge choice, so 
one must be a bit careful. As was noted in [l|, the zero 
sum gauge ([5]) minimizes the Frobenius norm, in a sense 
making ([8]) the most appropriate gauge choice for the 
score ([23)) . Recall from above that our pseudolikelihood 
uses the full representation and fixes the gauge by the 
regularization terms Ri 2 . Our procedure is therefore to 
first infer the interaction parameters using the pseudo- 
likelihood and the regularization, and then to change to 
the zero-sum gauge: 

4(fc, = Jij(k, l) - J iS (; I) - Jij(k, •) + JyO, •), (26) 

where '•' denotes average over the concerned position. 
One can show that (|26p preserves the probabilities of © 



(after altering the fields appropriately) and that J^(fe,i) 
satisfy (JSJ). A possible Frobenius norm score is hence 



-■FN 



IIJ', 



\ 



E 4(M 2 



(27) 



Lastly, we borrow an idea from Jones et al. [9j], whose 
PSICOV method also uses a norm rank (1-norm instead 
of Frobenius norm), but adjusted by an average-product 
correction (APC) term. APC was introduced in [H| to 
suppress effects from phylogenetic bias and insufficient 
sampling. Incorporating also this correction, we have 
our scoring function 



FN cFN 



s 



CN 



s, 



cFN c 
FN -j °i 



S FN 



(28) 



where CN stands for 'corrected norm'. An imple- 
mentation of plmDCA in MATLAB is available at 
http : //plmdca. esc . kth. se/l 



IV. EVALUATING THE PERFORMANCE OF 
MFDCA AND PLMDCA ACROSS PROTEIN 
FAMILIES 

We have performed numerical experiments using 
mfDCA and plmDCA on a number of domain families 
from the Pfam database; here we report and discuss the 
results. 



A. Domain families, native structures, and 
true-positive rates 

The speed of mfDCA enabled Morcos et al. [2j to 
conduct a large-scale analysis using 131 families. PLM 
is computationally more demanding than NMFI, so we 
chose to start with a smaller collection of 17 families, 
listed in Table HI To ease the numerical effort, we chose 
families with relatively small N and B. More precisely, 
we selected families out of the first 115 Pfam entries (low 
Pfam ID), which have (i) at most N — 130 residues, (ii) 
between 2,000 and 22,000 sequences, and (Hi) reliable 
structural information (cf. the PDB entries provided in 
the table). No selection based on DCA performance was 
done. In the appendix, a number of longer proteins is 
studied. The mfDCA performance on the selected fami- 
lies was found to be coherent with the larger data set of 
Morcos et al. 

To reliably assess how good a contact prediction is, 
something to regard as a 'gold standard' is helpful. For 
each of the 17 families we have therefore selected one rep- 
resentative high-resolution X-ray crystal structure (reso- 
lution below 3A), see Table U for the corresponding PDB 
identification. 

From these native protein structures, we have ex- 
tracted position-position distances d(i,j) for each pair of 
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Family ID 


N 


B 


Beff (90%) 


PDB ID 


UniProt entry 


UniProt residues 


PF00011 


102 


5024 


3481 


2bol 


TSP36_TAESA 


106-206 


PF00013 


58 


6059 


3785 


lwvn 


PCBP1JHUMAN 


281-343 


PF00014 


53 


2393 


1812 


5pti 


BPTUBOVIN 


39-91 


PF00017 


77 


2732 


1741 


lo47 


SRCJHUMAN 


151-233 


PF00018 


48 


5073 


3354 


2hda 


YESJHUMAN 


97-144 


PF00027 


91 


12129 


9036 


3fhi 


KAP0.BOVIN 


154-238 


PF00028 


93 


12628 


8317 


2o72 


CADH1JHUMAN 


267-366 


PF00035 


67 


3093 


2254 


loOw 


RNC.THEMA 


169-235 


PF00041 


85 


15551 


10631 


lbqu 


IL6RB .HUMAN 


223-311 


PF00043 


95 


6818 


5141 


6gsu 


GSTM1_RAT 


104-192 


PF00046 


57 


7372 


3314 


2vi6 


NANOGJVIOUSE 


97-153 


PF00076 


70 


21125 


14125 


lg2e 


ELAV4JHUMAN 


48-118 


PF00081 


82 


3229 


1510 


3bfr 


SODM.YEAST 


27-115 


PF00084 


56 


5831 


4345 


lelv 


C1S.HUMAN 


359-421 


PF00105 


70 


2549 


1277 


lgdc 


GCR.RAT 


438-507 


PF00107 


130 


17864 


12114 


la71 


ADH1EJHORSE 


203-338 


PF00111 


78 


7848 


5805 


la70 


FER1J3PIOL 


58-132 



TABLE I. Domain families included in our study, listed with Pfam ID, length N, number of sequences B (after removal of 
duplicate sequences), number of effective sequences B e jj (under x — 0.9, i.e., 90% threshold for reweighting), and the PDB 
and UniProt specifications for the structure used to access the DCA prediction quality. 




FIG. 2. (Color online) Histograms of crystal-structure distances pooled from all 17 families, when including all pairs (left), 
pairs with \j — i\ > 4 (center), and pairs with \j — i\ > 14 (right). The vertical line is our contact cutoff 8.5A. 



sequence positions, by measuring the minimal distance 
between any two heavy atoms belonging to the amino 
acids present in these positions. The left panel of Fig. [2] 
shows the distribution of these distances in all considered 
families. Three peaks protrude from the background dis- 
tribution: one at small distances below 1.5 A, a second at 
about 3-5A and a third at about 7-8A. The first peak cor- 
responds to the peptide bonds between sequence neigh- 
bors, whereas the other two peaks correspond to non- 
trivial contacts between amino acids, which may be dis- 
tant along the protein backbone, as can be seen from the 
center and right panels of Fig. [3J which collect only dis- 
tances between positions i and j with minimal separation 
\j — i\ > 5 resp. \j — i\ > 15. Following we take the 



peak at 3-5A to presumably correspond to short-range 
interactions like hydrogen bonds or secondary-structure 
contacts, whereas the last peak likely corresponds to 
long-range, possibly water-mediated interactions. These 
peaks contain the nontrivial information we would like 
to extract from sequence data using DCA. In order to 
accept the full second peak, we have chosen a distance 
cutoff of 8.5A for true contacts, slightly larger than the 
value of 8 A used in 0- 

Accuracy results are here reported primarily using 
true-positive (TP) rates, also the principal measurement 
in Q and @ . The TP rate for p is the fraction of the p 
strongest-scored pairs which are actually contacts in the 
crystal structure, defined as described above. To exem- 
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plify TP rates, let us jump ahead and look at Fig. [3l For 
PLM and protein family PF00076, the TP rate is 1 up 
to p = 80, which means that all 80 top-S^ N pairs are 
genuine contacts in the crystal structure. At p = 200, 
the TP rate has dropped to 0.78, so 0.78 • 200 = 156 of 
the top 200 top-Sf< N pairs are contacts, while 44 are not. 



B. Parameter settings 

To set the stage for comparison, we started by running 
initial trials on the 17 families using both NMFI and 
PLM with many different regularization and reweighting 
strengths. Reweighting indeed raised the TP rates, and, 
as reported in Q for 131 families, results seemed robust 
toward the exact choice of the limit x around 0.7 < x < 
0.9. We chose x = 0.9 to use throughout the study. 

In what follows, NMFI results are reported using the 
same list of pseudocounts as in Fig. SI 1 in 2]: A = 
w-B eff with to = {0.11, 0.25, 0.43, 0.67, 1.0, 1.5, 2.3, 4.0, 
9.0}. During our analysis we also ran intermediate values, 
and we found this covering to be sufficiently dense. We 
give outputs from two versions of NMFI: NMFI-DI and 
NMFI-DI(true). The former uses pseudocounts for all 
calculations, whereas the latter switches to true frequen- 
cies when it gets to the evaluations of S^ 1 . We append 
'DP to the NMFI name, since, later on, we will also try 
the Sf 3 N score for NMFI (which we will call NMFI-CN) . 

With I2 regularization in the PLM algorithm, out- 
comes were robust against the precise choice of A^; TP 
rates were almost identical when Xh was changed between 
0.001 and 0.1. We therefore chose Xh = 0.01 for all ex- 
periments. What mattered, rather, was the coupling reg- 
ularization parameters A,/, for which we did a systematic 
scan from A ,7 = and up using step-size 0.005. 

So, to summarize, the results reported here are based 
oni = 0.9, cutoff 8.5A, Xh = 0.01, and A and Aj drawn 
from collections of values as described above. 



C. Main comparison of mfDCA and plmDCA 

Fig. [3] shows TP rates for the different families and 
methods. We see that the TP rates of plmDCA (PLM) 
are consistently higher than those of mfDCA (NMFI), 
especially for families with large B e ff. For what con- 
cerns the two NMFI versions: NMFI-DI(true) avoids the 
strong failure seen in NMFI-DI for PF00084, but for most 
other families, see in particular PF00014 and PF00081, 
the performance instead drops using marginals without 
pseudocounts in the S® 1 calculation. For both NMFI-DI 
and NMFI-DI (true), the best regularization was found to 
be A = 1 • B e ff, the same value as used in [2j. For PLM, 
the best parameter choice was Xj = 0.01. Interestingly, 
this same regularization parameter was optimal for ba- 
sically all families. This is somewhat surprising, since 
both N and B e ff span quite wide ranges (48-130 and 
1277-14125 respectively). 



PF00076 
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FIG. 3. (Color online) Contact-detection results for the 17 
families, sorted by B e ff. The y-axes are TP rates and x- 
axes are the number of predicted contacts p, based on pairs 
with \j — i\ > 4. The three curves for each method are the 
three regularization levels yielding highest TP rates across all 
families. The thickened curve highlights the best one out of 
these three (A = B eff for NMFI and Xj = 0.01 for PLM). 



In the following discussion, we leave out all results for 
NMFI-DI (true) and focus on PLM vs. NMFI-DI, i.e., the 
version used in 0. All plots remaining in this section use 
the optimal regularization values: A = B e ff for NMFI 
and Xj = 0.01 for PLM. 

TP rates only classify pairs as contacts (d(i,j) < 8.5 A) 
or noncontacts (d(i,j) > 8.5A). To give a more detailed 
view of how scores correlate with spatial separation, we 
show in Fig. [J] a scatter plot of the score vs. distance for 
all pairs in all 17 families. PLM and NMFI-DI both man- 
age to detect the peaks seen in the true distance distribu- 
tion of Fig. [5J in the sense that high scores are observed 
almost exclusively at distances below 8.5A. Both meth- 
ods agree that interactions get, on average, progressively 
weaker going from peak one, to two, to three, and finally 
to the bulk. We note that the dots scatter differently 
across the PLM and NMFI-DI figures, reflecting the two 
separate scoring techniques: S^ 1 are strictly nonncga- 
tive, whereas APC corrected norms can assume negative 
values. We also observe how sparse the extracted signal 
is: most spatially close pairs do not show elevated scores. 
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FIG. 4. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-DI (right). 
The vertical line is our contact cutoff at 8.5A. 



However, from the other side, almost all strongly coupled 
pairs are close, so the biological hypothesis of Sec. [XT] is 
well supported here. 

Fig.[5]shows scatter plots of scores for PLM and NMFI- 
DI for some selected families. Qualitatively the same pat- 
terns were observed for all families. The points are clearly 
correlated, so, to some extent, PLM and NMFI-DI agree 
on the interaction strengths. Due to the different scoring 
schemes, we would not expect numerical coincidence of 
scores. Many of PLM's top-scoring position pairs have 
also top scores for NMFI-DI and vice versa. The largest 
discrepancy is in how much more strongly NMFI-DI re- 
sponds to pairs with small \j — i\; the crosses tend to 
shoot out to the right. PLM agrees that many of these 
neighbor pairs interact strongly, but, unlike NMFI-DI, it 
also shows rivaling strengths for many \j — i\ > 4-pairs. 

An even more detailed picture is given by considering 
contact maps, see Fig. [51 The tendency observed in the 
last scatter plots remains: NMFI-DI has a larger por- 
tion of highly scored pairs in the neighbor zone, which 
are the middle stretches in these figures. An important 
observation is, however, that clusters of contacting pairs 
with long ID sequence separation are captured by both 
algorithms. 

Note, that only a relatively small fraction of contacts 
is uncovered by DCA, before false positives start to ap- 
pear, and that many native contacts are missed. How- 
ever, the aim of DCA cannot be to reproduce a com- 
plete contact map: It is based on sequence correlations 
alone (e.g. it cannot predict contacts for 100% conserved 
residues), it does not consider any physico-chemical prop- 
erty of amino acids (they are seen as abstract letters), it 
does not consider the need to be embeddable in 3D. Fur- 
thermore, proteins may undergo conformational changes 
(e.g. allostery) or homo-dimerize, so coevolution may be 



induced by pairs which are not in contact in the X-ray 
crystal structure used for evaluating prediction accuracy. 
The important point - and this seems to be a distinguish- 
ing feature of maximum-entropy models as compared to 
simpler correlation-based methods - is to find a sufficient 
number of well-distributed contacts to enable de-novo 3D 
structure prediction 0-0, |40T - |42| . In this sense, it is more 
important to achieve high accuracy of the first predicted 
contacts, than intermediate accuracy for many predic- 
tions. 

In summary, the results suggest that the PLM method 
offers some interesting progress compared to NMFI. How- 
ever, let us also note that in the comparison we also had 
to change both scoring and regularization styles. It is 
thus conceivable that a NMFI with the new scoring and 
regularization could be more competitive with PLM. In- 
deed, upon further investigation, detailed in Appendix IB] 
we found that part of the improvement in fact does stem 
from the new score. In Appendix [C] where we extend our 
results to longer Pfam domains, we therefore add results 
from NMFI-CN, an updated version of the code used in 
[H which scores by S£ N instead of S^ 1 . 



D. Run times 

In general, NMFI, which is merely a matrix inversion, 
is very quick compared with PLM; most families in this 
study took only seconds to run through the NMFI code. 

In contrast to the message-passing based method used 
Q, a DCA using PLM is nevertheless feasible for all 
protein families in Pfam. The objective function in PLM 
is a sum over nodes and samples and its execution time 
is therefore expected to depend both on B (number of 
members of a protein family in Pfam) and N (length of 
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FIG. 5. (Color online) Scatter plots of interaction scores for 
PLM and NMFI-DI from four families. For all plots, the axes 
are as indicated by the top left figure. The distance unit in 
the top box is A. 



the aligned sequences in a protein family). 

On a standard desktop computer, using a basic 
MATLAB-interfaced C-implementation of conjugate gra- 
dient descent, the run times for PF00014, PF00017, and 
PF00018 (small N and B) were 50, 160 and 90 respec- 
tively. For PF00041 (small N but larger B) one run 
took 15 min. For domains with larger N, like those in 
Appendix [C] run times grow approximately apace. For 
example, the run times for PF00026 (N = 314) and 
PF00006 (N — 215) were 80 and 65 min respectively. 

A well-known alternative use of pseudolikelihoods is to 
minimize each g r separately. While slightly more crude, 
such an 'asymmetric' variant of plmDCA would amount 
to N independent (multiclass) logistic regression prob- 
lems, which would make parallel execution trivial on up 
to N cores. A rigorous examination of the asymmetric 
version's performance is beyond the scope of the present 
work, but initial tests suggest that even on a single pro- 
cessor it requires convergence times almost an order of 
magnitude smaller than the symmetric one (which we 
used in this work) , while still yielding almost exactly the 
same TP rates. Using N processors, the above mentioned 
run times could thus, in principle, be dropped by a factor 



FIG. 6. (Color online) Predicted contact maps for PLM (right 
part, black symbols) and NMFI-DI (left part, red symbols 
online) for four families. A pair (i,j)'s placement in the plots 
is found by matching positions i and j on the axes. Native 
contacts are indicated in gray. True and false positives are 
represented by circles and crosses, respectively. Each figure 
shows the 1.5JV strongest ranked pairs (including neighbors) 
for that family. 

as large as lOiV, suggesting that plmDCA can be made 
competitive not only in terms of accuracy, but also in 
terms of computational speed. 

Finally, all of these times were obtained cold-starting 
with all fields and couplings at 0. Presumably, one can 
improve by using an appropriate initial guess obtained, 
say, from NMFI. This has however not been implemented 
here. 



V. DISCUSSION 

In this work, we have shown that a direct-couping anal- 
ysis built on pseudolikelihood maximization (plmDCA) 
consistently outperforms the previously described mean- 
field based analysis (mfDCA), as assessed across a num- 
ber of large protein-domain families. The advantage of 
the pseudolikelihood approach was found to be partially 
intrinsic, and partly contingent on using a sampling- 
corrected Frobenius norm to score inferred direct statis- 
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tical coupling matrices. 

On one hand, this improvement might not be sur- 
prising: it is known that for very large data sets PLM 
becomes asymptotically equivalent to full maximum- 
likelihood inference, whereas mean-field inference re- 
mains intrinsically approximate, and this may result in 
an improved PLM performance also for finite data sets 

On the other hand, the above advantage holds if and 
only if the following two conditions are fulfilled: Data 
are drawn independently from a probability distribution, 
and this probability distribution is the Boltzmann dis- 
tribution of a Potts model. None of these two con- 
ditions actually hold for real protein sequences. On 
artificial data, refined mean-field methods (Thouless- 
Anderson-Palmer equations, Bethe approximation) also 
lead to improved model inference as compared to NMFI, 
cf. e.g. [lj, [H, [13, H3, but no such improvement has 
been observed in real protein data [2|. The results of 
the paper are therefore interesting and highly nontriv- 
ial. They also suggest that other model-learning meth- 
ods from statistics such as 'Contrastive Divergence' J53 
or the more recent 'Noise-Contrastive Estimation' [53j . 
could be explored to further increase our capacity to ex- 
tract structural information from protein sequence data. 

Disregarding the improvements, we find that overall 
the predicted contact pairs for plmDCA and mfDCA are 
highly overlapping, illustrating the robustness of DCA 
results with respect to the algorithmic implementation. 
This observations suggests that, in the context of model- 



ing the sequence statistics by pairwise Potts models, most 
extractable information might already be extracted from 
the MSA. However, it may well also be that there is al- 
ternative information hidden in the sequences, for which 
we would need to go beyond pairwise models, or inte- 
grate the physico-chemical properties of different amino 
acids into the procedure, or extract even more informa- 
tion from large sets of evolutionarily related amino-acid 
sequences. DCA is only a step in this direction. 

In our work we have seen that simple sampling cor- 
rections, more precisely sequence reweighting and the 
average-product correction of interaction scores, lead to 
an increased accuracy in predicting 3D contacts of amino 
acids, which are distant on the protein's backbone. It is, 
however, clear that these somewhat heuristic statistical 
fixes cannot correct for the complicated hierarchical phy- 
logenetic relationships between proteins, and that more 
sophisticated methods would be needed to disentangle 
phylogenetic from functional correlations in massive se- 
quence data. To do so is an open challenge, which would 
leave the field of equilibrium inverse statistical mechan- 
ics, but where methods of inverse statistical mechanics 
may still play a useful role. 
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Appendix A: Circle plots 

To get a sense of how false positives distribute across 
the domains, we draw interactions into circles in Fig. [7] 
Among erroneously predicted contacts there is some ten- 
dency towards loopiness, especially for NMFI-DI; the 
black lines tend to 'bounce around' in the circles. It 
hence seems that relatively few nodes are responsible for 
many of the false positives. We performed an explicit 
check of the data columns belonging to these 'bad' nodes, 
and we found that they often contained strongly biased 
data, i.e., had a few large fi(fc). In such cases, it seemed 
that NMFI-DI was more prone than PLM to report a 
(predicted) interaction. 



Crystal structure PLM NMFI-DI 




FIG. 7. (Color online) Connections for four families overlaid 
on circles. Position '1' is indicated by a dash. The leftmost 
column shows contacts in the crystal structure (dark gray for 
d(i, j) < 5A and light gray for 5A< d(i, j) < 8.5A). The other 
two columns show the top 1.5JV strongest ranked \j — i\ > 4- 
pairs for PLM and NMFI-DI, with gray for true positives and 
black for false positives. 



14 



Appendix B: Other scores for naive mean-field 
inversion 



is in fact about the same for both methods. Still, PLM 
maintains a somewhat higher TP rates overall. 



We also investigated NMFI performance using the 
APC term for the S® 1 scoring and using our new S£ N 
score. In the second case we first switch the parameter 
constraints from © to © using (|26p . Mean TP rates us- 
ing the modified score are shown in Fig. HI We observe 
that APC in S® 1 scoring increases TP rates slightly, 
while Sj? N scoring can improve TP rates overall. We 



remark, however, that for the second-highest ranked in- 

h the original 
(NMFI-CN). 
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FIG. 8. (Color online) Mean TP rates, using pairs with Jj — 
i\ > 4, for NMFI with old score S? 1 , new APC score S% DI , 
and the norm score S^ N . Each curve corresponds to the best 
A for that particular score. 

Motivated by the results of Fig. [SJ we decided to com- 
pare NMFI and PLM under the S? N score. All figures 
in this paragraph show the best regularization for each 
method, unless otherwise stated. Figure [9] shows score vs. 
distance for all pairs in all families. Unlike Fig. 21 the 
two plots now show very similar profiles. We note, how- 
ever, that NMFI's Sfj N scores trend two to three times 
larger than PLM's (the scales on the vertical axes are 
different). Perhaps this is an inherent feature of these 
methods, or simply a consequence of the different types 
of regularization. 

Figure [10] shows the same situation as Fig. [3] but us- 
ing Sy to score NMFI. The three best regularization 
choices for NMFI-CN turned out the same as before, i.e., 
A = 1 • B e ff, A = 1.5 • B e fj and A = 2.3 • B e fj, but 
the best out of these three was now A = 2.3 • B e // (in- 
stead of A = 1 • B e ff), Comparing Fig. [3] and Fig. [TO] one 
can see that the difference between the two methods is 
now smaller; for several families, the prediction quality 



FIG. 10. (Color online) Contact-detection results for all the 
families in our study (sorted by B e ff), now with the <Sjj 
score for NMFI. The y-axes are TP rates and the x-axes 
are the number of predicted contacts p, based on pairs with 
\j — i\ > 4. The three curves for each method are the three 
regularization levels yielding highest TP rates across all fam- 
ilies. The thickened curve highlights the best one out of these 
three (A = 2.3 • B eff for NMFI-CN and Aj = 0.01 for PLM). 

Figure [11] shows scatter plots for the same families as 
in Fig.[S]but using the scoring for NMFI. The points 
now lie more clearly on a line, from which we conclude 
that the bends in Fig. [5] were likely a consequence of 
differing scores. Yet, the trends seen in Fig. [5] remain: 
NMFI gives more attention to neighbor pairs than does 
PLM. 

In Fig. [T2l we recreate the contact maps of Fig. [6]with 
NMFI-CN in place of NMFI-DI and find that the plots 
are more symmetric. As expected, asymmetry is seen 
primarily for small \j — i\; NMFI tends to crowd these 
regions with lots of loops. 

To investigate why NMFI assembles so many top- 
scored pairs in certain neighbor regions, we performed 
an explicit check of the associated MSA columns. A rel- 
evant regularity was observed: when gaps appear in a 
sequence, they tend to do so in long strands. The pic- 
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FIG. 9. (Color online) Score plotted against distance for all position pairs in all 17 families for PLM (left) and NMFI-CN 
(right). The vertical line is our contact cutoff at 8.5A. 



ture can be illustrated by the following hypothetical MSA 
(in our implementation, the gap state is 1): 



I '• \ 

••• 6 5 9 7 2 6 8 7 4 4 2 2--- 

••• 1 1 1 1 1 1 1 1 1 1 2 8--- 

••• 6 5 2 7 2 3 8 9 5 4 2 3--- 

■•■ 37472687942 3--- 

■•■ 37472388942 9--- 

••• 1 1 1 1 1 1 1 4 5 4 2 9- •• 

••• 85972987442 4--- 

••• 11111111112 A--- 

V ••• / 



We recall that gaps (T' states) are necessary for sat- 
isfactory alignment of the sequences in a family and 
that in our procedure we treat gaps just another amino 
acid, with its associated interaction parameters. We then 
make the obvious observation that independent samples 
from a Potts model will only contain long subsequences 
of the same state with low probability. In other words, 
the model to which we fit the data cannot describe long 
stretches of T' states, which is a feature of the data. It is 
hence quite conceivable that the two methods handle this 
discrepancy between data and models differently since we 
do expect this gap effect to generate large 1) for at 

least some pairs with small \j — i\. 



« Neighbor pair: |i-j|<5 




o Contact: 0< true distance < £ 


.5 (and |i-j| > 5) 


■ Non contact: true distance > 


8.5 (and |i-j| > 5) 




2 4 1 2 3 4 
NMFI-QN ISM, , , , , , 




01234 01234 



FIG. 11. (Color online) Scatter plots of interaction scores for 
PLM and NMFI-CN from four families. For all plots, the axes 
are as indicated by the top left figure. The distance unit in 
the top box is A. 
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PF00028 PF00035 








FIG. 12. (Color online) Predicted contact maps for PLM 
(right part, black symbols) and NMFI-CN (left part, green 
symbols online) from four families. A pair (i,j)'s placement 
in the plots is found by matching positions i and j on the 
axes. Contacts are indicated by gray (dark for d(i,j) < 5A 
and light for 5A< d(i,j) < 8.5A). True and false positives are 
represented by circles and crosses, respectively. Each figure 
shows the 1.5N strongest ranked pairs (including neighbors) 
for that family. 



2 - 
1.5 
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FIG. 13. Scatter plots of estimated Jij,kl = Jij(k,l) from 
PF00014 (top) and PF00043 (bottom). Black dots are 'gap- 
gap' interactions (k = I = 1), dark gray dots are 'gap-amino- 
acid' interactions (k = 1 and I ^ 1, or k ^ 1 and 1 = 1), 
and light gray dots are 'amino-acid-amino-acid' interactions 
(k 1 and 1). 



Figure [13] shows scatter plots for all coupling param- 
eters Jij(k,l) in PF00014, which has a modest amount 
of gap sections, and in PF00043, which has relatively 
many. As outlines above, the Jy (l, l)-parameters are 
among the largest in magnitude, especially for PF00043. 
We also note that the black dots steer to the right; NMFI 
clearly reacts more strongly to the gap-gap interactions 
than PLM. 



Jones et al. [9( disregarded contributions from gaps in 
their scoring by simply skipping the gap state when do- 
ing their norm summations. We tried this but found no 
significant improvement for cither method. The change 
seemed to affect only pairs with small \j — i\ (which is 
reasonable), and our TP rates are based on pairs with 
\j — i\ > 4. If gap interactions are indeed responsible for 
reduced prediction qualities, removing their input during 
scoring is just a Band-Aid type solution. A better way 
would be to suppress them already in the parameter es- 
timation step. That way, all interplay would have to be 
accounted for without them. Whether or not there are 
ways to effectively handle the inference problem in PSP 
by ignoring gaps or treating them differently, is an issue 
which goes beyond the scope of this work. 

We also investigated whether the gap effect depends on 
the sequence similarity reweighting factor x, which up to 
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here was chosen x — 0.9. Perhaps the gap effect can 
be dampened by stricter definition of sequence unique- 
ness? In Fig. [14] we show another set of TP rates, but 
now for x — 0.75. We also include results for NMFI run 
on alignment files from which all sequences with more 
than 20% gaps have been removed. The best regulariza- 
tion choice for each method turned out the same as in 
Fig. [[DJ X = 2.3 • B eff for NMFI-CN and Xj = 0.01 for 
PLM. Overall, PLM maintains the same advantage over 
NMFI-CN it had in Fig. [TU] Removing gappy sequences 
seems to trim down more TP rates than it raises, prob- 
ably since useful information in the nongappy parts is 
discarded unnecessarily. 
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the wider range of 50-400. The list is given in Table [III 
We here kept the reweighting level at x = 0.8 as in @], 
while the TP rates were again calculated using the cutoff 
8.5A. The pseudocount strength for NMFI was varied in 
the same interval as in the main text. We did not try 
to optimize the PLM regularization parameters for this 
trial, but merely used Xh = Xj = 0.01 as determined for 
the smaller families in the main text. 



Figure [15] shows qualitatively the same behavior as in 
the smaller set of families: TP rates increase partly from 
changing from the S® 1 score to the N score, and partly 
from changing from NMFI to PLM. Our positive results 
thus do not seem to be particular to short-length families. 



Apart from the average TP rate for each value of p 
(p'th strongest predicted interactions) one can also eval- 
uate performance by different criteria. In this larger sur- 
vey we investigated the distribution of values of p such 
that the TP rate in a family is one. Fig. [T5] shows the 
histograms of the number of families for which the top 
p predictions are correct, clearly showing that the differ- 
ence between PLM and NMFI (using the two scores) pri- 
marily occurs at the high end. The difference in average 
performance between PLM and NMFI at least partially 
stems from PLM getting more strongest contact predic- 
tions with 100% accuracy. 
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FIG. 14. (Color online) Contact-detection results for all the 
families in our study. The y-axes are TP rates and x-axes are 
the number of predicted contacts p, based on pairs with \j — 
i\ > 4. Two curves are included for reweighting margin x — 
0.75, and one for reweighting margin x — 0.9 after deletion 
of all sequences with more than 20% gaps. Results for each 
method correspond to the regularization level yielding highest 
TP rates across all families (A = 2.3-B eff for both NMFI-CN 
and Xj = 0.01 for PLM). 




Number of predicted pairs 



Appendix C: Extension to 28 protein families 

To sample a larger set of families, we conducted an ad- 
ditional survey of 28 families, now covering lengths across 



FIG. 15. (Color online) Mean TP rates over the larger set of 
28 families for PLM with A j = 0.01 and varying regularization 
values for NMFI-CN and NMFI-DI. 
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FIG. 16. (Color online) Distribution of 'perfect' accuracy for 
the three methods. The x-axis shows the number of top- 
ranked pairs for which the TP rates stays at one, and the 
y-axis shows the number of families. 
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ID 


N 


B 


Beff (80%) 


PDB ID 


UniProt entry 


UniProt residues 


PF00006 


215 


10765 


641 


2r9v 


ATPA.THEMA 


149-365 


PF00011 


102 


5024 


2725 


2bol 


TSP36.TAESA 


106-206 


PF00014 


53 


2393 


1478 


5pti 


BPTl_BOVIN 


39-91 


PF00017 


77 


2732 


1312 


lo47 


SRCLHUMAN 


151-233 


PF00018 


48 


5073 


335 


2hda 


YES.HUMAN 


97-144 


PF00025 


175 


2946 


996 


lfzq 


ARL3.MOUSE 


3-177 


PF00026 


314 


3851 


2075 


3er5 


CARP.CRYPA 


105-419 


PF00027 


91 


12129 


7631 


3£hi 


KAP0_BOVIN 


154-238 


PF00028 


93 


12628 


6323 


2o72 


CADH1JHUMAN 


267-366 


PF00032 


102 


14994 


684 


lzrt 


CYB_RHOCA 


282-404 


PF00035 


67 


3093 


1826 


loOw 


RNC.THEMA 


169-235 


PF00041 


85 


15551 


8691 


lbqu 


IL6RB .HUMAN 


223-311 


PF00043 


95 


6818 


4052 


6gsu 


GSTMURAT 


104-192 


PF00044 


151 


6206 


1422 


lcrw 


G3P.PANVR 


1-148 


PF00046 


57 


7372 


1761 


2vi6 


NANOG.MOUSE 


97-153 


PF00056 


142 


4185 


1120 


la5z 


LDH.THEMA 


1-140 


PF00059 


108 


5293 


3258 


Hit 


REG1AJHUMAN 


53-164 


PF00071 


161 


10779 


3793 


5p21 


RASH_HUMAN 


5-165 


PF00073 


171 


9524 


487 


2r06 


POLG.HRV14 


92-299 


PF00076 


70 


21125 


10113 


lg2e 


ELAV4_HUMAN 


48-118 


PF00081 


82 


3229 


890 


3bfr 


SODM.YEAST 


27-115 


PF00084 


56 


5831 


3453 


lelv 


C1S.HUMAN 


359-421 


PF00085 


104 


10569 


6137 


3gnj 


VWFJIUMAN 


1691-1863 


PF00091 


216 


8656 


917 


2r75 


FTSZ.AQUAE 


9-181 


PF00092 


179 


3936 


1786 


latz 


VWFJIUMAN 


1691-1863 


PF00105 


70 


2549 


816 


lgdc 


GCR.RAT 


438-507 


PF00108 


264 


6839 


2688 


3goa 


FADAJ3ALTY 


1-254 



TABLE II. Domain families included in our extended study, listed with Pfam ID, length N, number of sequences B (after 
removal of duplicate sequences), number of effective sequences -Be// (under x = 0.8, i.e., 80% threshold for reweighting) , and 
the PDB and UniProt specifications for the structure used to access the DCA prediction quality. 



